Performance Comparison of Apache Spark and Tez for Entity Resolution
نویسندگان
چکیده
Entity Resolution is among the hottest topics in the field of Big data. It finds duplicates in datasets, which actually belong to same entity in the real world. Algorithms that perform Entity Resolution are computation intensive and consume a lot of time especially for large datasets. A lot of research has been conducted for improving Entity Resolution solutions. A number of algorithms are developed, in attempt to reduce the time required to execute Entity Resolution algorithms on a given dataset. Efficiency of Entity Resolution algorithms has significantly improved but is still not adequate for large datasets in the Big data field. We are contributing to enhance its performance in terms of time, not by improving the algorithm but finding the most suitable platform on which it runs. This would, in turn, increase its efficiency and indirectly elevate the accuracy of Entity Resolution by empowering it to run more computation intensive algorithm. We have shortlisted Apache Spark(RDD, DataFrame and Dataset) and Apache Tez (Hive) as the set of platforms. In this research work we have chosen the Blocking technique for implementing Entity Resolution in the four above mentioned different applications. We have performed a number of experiments with different configurations to find the most efficient platform by analyzing, comparing and evaluating the results in great detail.
منابع مشابه
Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling
Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory elasticity, an intrinsic property of dataparallel tasks. Memory elasticity allows tasks to run with significantly less memory that they would ideally want while onl...
متن کاملRoaring bitmaps: Implementation of an optimized software library
Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix’s Atlas, LinkedIn’s Pivot, Metamarkets’ Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Servic...
متن کاملCharacterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments
BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases —queries— which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource require...
متن کاملThe Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملIn-Stream Big Data Processing
The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that realtime query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, A...
متن کامل